Title¶
Using a K-NN Classification Model to Predict the Genre of a given Song based on Danceability and Energy
Introduction¶
Today, listening to music has been more accessible than ever. Popular streaming platforms like Spotify make it easy for users to discover new music genres and receive recommendations aligned with their music preferences (Ignatius Moses Setiadi et al., 2020). Music recommendations play a crucial role in helping users find songs specifically tailored to their tastes, which often involves the process of classifying music genres via a variety of classifiers (Ignatius Moses Setiadi et al., 2020). The enjoyment of a song can depend on various factors, such as emotional impact, catchy melodies, or impactful lyrics (Khan et al., 2022). Additionally, audio features like loudness, tempo or energy can be used to classify a song’s genre, and are often used by music streaming platforms to recommend new songs to their users (Khan et al., 2022).
Based on this information, the question we want to answer with our project is: “What is the genre of a given song based on its danceability and energy values?” This is a classification question, which uses one or more variables to predict the value of a categorical variable of interest. We will be using the K-nearest neighbors algorithm to predict the genre for our chosen song. KNN is used to predict the correct class for the test data by calculating the Euclidean distance between the test data and all the training points (Taunk et al., 2019). The test data is assigned to the class that corresponds to its K nearest neighbors, with ‘K’ being the number of neighbors that must be considered (Taunk et al., 2019). The best value of K depends on the dataset and is not always the largest value, because other undesired points may get included in the neighborhood and blur the classification boundaries (Taunk et al., 2019). The dataset we will be using is “Dataset of songs in Spotify'' from Kaggle. This dataset has 22 columns titled: danceability, energy, key, loudness, mode, speechless, acousticness, instrumentalness, liveness, valence, tempo, type, id, uri, track_href, analysis_url, duration_ms, time_signature, and song_name. The full list of genres includes Trap, Techno, Techhouse, Trance, Psytrance, Dark Trap, DnB (drums and bass), Hardstyle, Underground Rap, Trap Metal, Emo, Rap, RnB, Pop and Hiphop. We will be using danceability (from 0-0.99), energy (from 0-1) and and the genres: Emo, Hardstyle, and Hiphop in our project.
Expected outcomes and significance¶
What do you expect to find?
- Correlations between genre and either energy or danceability. For example, we might find that hiphop may have higher danceability scores while hardstyle may have lower danceability scores.
- Correlations between genre and both energy and danceability. For example, we expect to find that songs with relatively higher danceability and energy are more likely to be hiphop songs while lower danceability and energy scores are more likely to be emo songs.
What impact could such findings have?
- These findings can improve the music that is recommended to users in music apps. By exploring the user's preference for danceability and energy in music, the app can better recommend more personalized music based on these trends.
- These findings can help users find music for different occasions. Users can use this information to select the appropriate music or genre for different occasions.
What future questions could this lead to?
- This project can lead to thinking about how to make a more accurate genre predicting model. We can consider and incorporate more features of music that influences genre to make a more comprehensive and accurate model.
- This project can also lead us to be curious about other trends with these variables, such as how genres have evolved over time in terms of danceability and energy.
Methods¶
Using the “Dataset of songs in Spotify'' dataset, we will be conducting a K-NN classification on specific songs within the dataset to predict their genre. This will be done by specifically using “danceability” and “energy” as the predictor variables and “genre” as the response variable. We will first filter our dataset to only include danceability, energy and genre as the only 3 variables, tidy the data, and further shrink the data by selecting for the 3 genres: Emo, hardstyle and Hiphop. We will then set aside specific observations from the data which our classifier will be predicting the genre for. We will then begin building, tuning, and evaluating our K-NN classification model. This will include dividing the data up into a training set and testing set, using the training set to build and tune our model through cross-validation and evaluating our chosen K value using the testing set. Finally, we will then use this classification model to predict the selected songs we initially set aside and graph the data using a scatterplot. This scatterplot will include the energy variable in the x-axis, danceability in the y-axis, color coding for each of the 3 genres, as well as a different color indicator for the observations we are predicting for.
Data analysis¶
library(readr)
library(repr)
library(tidyverse)
library(tidymodels)
library(ggplot2)
options(repr.matrix.max.rows = 10)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.3 ✔ purrr 1.0.2 ✔ forcats 1.0.0 ✔ stringr 1.5.0 ✔ ggplot2 3.4.3 ✔ tibble 3.2.1 ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ── ✔ broom 1.0.5 ✔ rsample 1.2.0 ✔ dials 1.2.0 ✔ tune 1.1.2 ✔ infer 1.0.4 ✔ workflows 1.1.3 ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1 ✔ parsnip 1.1.1 ✔ yardstick 1.2.0 ✔ recipes 1.0.8 ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── ✖ scales::discard() masks purrr::discard() ✖ dplyr::filter() masks stats::filter() ✖ recipes::fixed() masks stringr::fixed() ✖ dplyr::lag() masks stats::lag() ✖ yardstick::spec() masks readr::spec() ✖ recipes::step() masks stats::step() • Use suppressPackageStartupMessages() to eliminate package startup messages
urlfile="https://raw.githubusercontent.com/brandonzchen/GroupProjDSCI/main/genres_v2.csv"
mydata<-read_csv(url(urlfile))
Rows: 42305 Columns: 22 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (8): type, id, uri, track_href, analysis_url, genre, song_name, title dbl (14): danceability, energy, key, loudness, mode, speechiness, acousticne... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#This is the code for a summary of the information of the data
datainformation <- mydata |>
select(danceability, energy, genre) |>
filter(genre == "Emo" | genre == "hardstyle" | genre == "Hiphop") |>
group_by(genre) |>
summarise(count = n(),
mean_energy = mean(energy),
mean_danceability = mean(danceability))
datainformation
| genre | count | mean_energy | mean_danceability |
|---|---|---|---|
| <chr> | <int> | <dbl> | <dbl> |
| Emo | 1680 | 0.7611750 | 0.4936988 |
| Hiphop | 3028 | 0.6544179 | 0.6989818 |
| hardstyle | 2936 | 0.8962384 | 0.4780270 |
First, to load our data set into our R file, we uploaded the dataset to a public Github repository so that anyone can download our file and run the code without issue. Using read_csv, we can load the data into our R file and we set the data to be represented as “mydata”. We then selected three genres from our dataset to start our analysis on the KNN classification method. We start our analysis by collecting general information about the energy and danceability of these genres as well as the number of songs that we have in these genres in our dataset. To do this, we start by selecting danceability, genre, and energy columns of our dataset as these are the only three columns relevant to our analysis. We then filter the resulting dataset to include only songs from the genres “Emo”, “Hardstyle” and “Hiphop”. We then use group_by and summarize to get information about the number of songs in each genre, and the mean danceability and mean energy for each genre. This information gives us a preliminary look at the spread of the data points.
song_data <- mydata |>
select(danceability, energy, genre) |>
filter(genre == "Emo" | genre == "hardstyle" | genre == "Hiphop") |>
mutate(genre = as_factor(genre)) |>
drop_na()
genre_plot <- song_data |>
ggplot(aes(x = energy, y = danceability)) +
geom_point(alpha = 0.4, aes(colour = genre)) +
ggtitle("Figure 1: Scattorplot of the Genres based on Energy and Danceability") +
xlab("Energy") +
ylab("Danceability") +
labs(colour = "Genre") +
theme(text = element_text(size = 18))
options(repr.plot.width = 10, repr.plot.height = 8)
genre_plot
We then once again select and filter all the data and use a scatter plot to represent the data. With energy on the x-axis and danceability on the y-axis, all of the data points are plotted and colour-coded by genre. This scatterplot gives us a better representation of the overlap of the genres and which genres have higher or lower energy and danceability.
set.seed(2023)
song_split <- initial_split(song_data, prop = 0.75, strata = genre)
song_train <- training(song_split)
song_test <- testing(song_split)
knn_recipe <- recipe(genre ~ energy + danceability, data = song_train) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")
knn_vfold <- vfold_cv(song_train, v = 5, strata = genre)
k_vals <- tibble(neighbors = seq(from = 75, to = 100, by = 5))
knn_results <- workflow() |>
add_recipe(knn_recipe) |>
add_model(knn_spec) |>
tune_grid(resamples = knn_vfold, grid = k_vals) |>
collect_metrics()
accuracies <- knn_results |>
filter(.metric == "accuracy")
k_vs_accuracy_plot <- accuracies |>
ggplot(aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
labs(x = "Neighbors", y = "Estimated Accuracy") +
ggtitle("Figure 2: Plot of Number of Neighbours vs Estimated Accuracy") +
theme(text = element_text(size = 15)) +
scale_x_continuous(breaks = seq(75, 100, by = 5))
options(repr.plot.width = 10, repr.plot.height = 8)
k_vs_accuracy_plot
To start our KNN classification, we split our data using a 75-25 training-testing split. We create a KNN recipe with energy and danceability as our predictors and also using step_scale and step_centre to normalize our predictors. We then use nearest_neighbour to declare that we will be doing a KNN classification. Using vfold_cv and setting v = 5, we use a 5-fold cross validation. We then use tibble to create values from 75 to 100, stepping by 5 to find the number of neighbors that yields the best accuracy. Putting the recipe, model and 5-fold cross validation together using workflow(), we then filter the results by accuracy and plot the estimated accuracy by the number of neighbors to find which number of neighbors gives the highest accuracy. We find from the plot that 80 neighbors yields the highest accuracy and thus we choose 80 neighbors for the classification process. Note: this tuning process takes around 3 minutes.
set.seed(2023)
song_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 80) |>
set_engine("kknn") |>
set_mode("classification")
song_fit <- workflow() |>
add_recipe(knn_recipe) |>
add_model(song_spec) |>
fit(data = song_train)
song_test_predictions <- predict(song_fit, song_test) |>
bind_cols(song_test) |>
metrics(truth = genre, estimate = .pred_class) |>
filter(.metric == "accuracy")
song_test_predictions
| .metric | .estimator | .estimate |
|---|---|---|
| <chr> | <chr> | <dbl> |
| accuracy | multiclass | 0.7247514 |
After finding the most accurate number of neighbors to choose, we can then start our KNN classification and predictions on our testing split. We create a new model using nearest_neighbour but set the neighbor value to be 80. We find the accuracy of our model using 80 neighbors by predicting the genres based on our model, and then using metrics to find the accuracy. We find that using 80 neighbors has an accuracy of around 72.5%.
song_recipe <- recipe(genre ~ energy + danceability, data = song_data) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
song_fit_real <- workflow() |>
add_recipe(song_recipe) |>
add_model(song_spec) |>
fit(data = song_data)
new_song_1 <- tibble(energy = 0.29, danceability = 0.56)
new_song_2 <- tibble(energy = 0.889, danceability = 0.628)
new_song_3 <- tibble(energy = 0.84, danceability = 0.75)
new_song_1_predicted <- predict(song_fit_real, new_song_1)
new_song_2_predicted <- predict(song_fit_real, new_song_2)
new_song_3_predicted <- predict(song_fit_real, new_song_3)
new_song_1_predicted
new_song_2_predicted
new_song_3_predicted
| .pred_class |
|---|
| <fct> |
| Emo |
| .pred_class |
|---|
| <fct> |
| hardstyle |
| .pred_class |
|---|
| <fct> |
| Hiphop |
Now that we have trained our data and found the most accurate classification using 80 neighbors, we can demonstrate our model by predicting the genre of 3 songs based on its energy and danceability. We find 3 songs and set their danceability and energy values using tibble. Then, using our model, we predict the genre of these three songs to which we get results based off the KNN classification method.
new_songs_predicted_plot <- song_data |>
ggplot(aes(x = energy, y = danceability)) +
geom_point(alpha = 0.4, aes(colour = genre)) +
xlab("Energy") +
ylab("Danceability") +
labs(colour = "Genre") +
theme(text = element_text(size = 12)) +
geom_point(aes(x = 0.29, y = 0.56), color = "black", size = 4) +
geom_point(aes(x = 0.889, y = 0.628), color = "purple", size = 4) +
geom_point(aes(x = 0.84, y = 0.75), color = "brown", size = 4) +
ggtitle("Figure 3: Scattorplot of Genres based on Energy and Danceability with New Song Predictions")
options(repr.plot.width = 10, repr.plot.height = 8)
new_songs_predicted_plot